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Abstract 

Recent work has established an empirically successful framework for adapting 
learning rates for stochastic gradient descent (SGD). This effectively removes 
all needs for tuning, while automatically reducing learning rates over time on 
stationary problems, and permitting learning rates to grow appropriately in non- 
stationary tasks. Here, we extend the idea in three directions, addressing proper 
minibatch parallelization, including reweighted updates for sparse or orthogonal 
gradients, improving robustness on non-smooth loss functions, in the process re- 
placing the diagonal Hessian estimation procedure that may not always be avail- 
able by a robust finite-difference approximation. The final algorithm integrates all 
these components, has linear complexity and is hyper-parameter free. 



1 Introduction 

Many machine learning problems can be framed as minimizing a loss function over a large (maybe 
infinite) number of samples. In representation learning, those loss functions are generally built 
on top of multiple layers of non-linearities, precluding any direct or closed-form optimization, but 
admitting (sample) gradients to guide iterative optimization of the loss. 

Stochastic gradient descent (SGD) is among the most broadly applicable and widely-used algo- 
rithms for such learning tasks, because of its simplicity, robustness and scalability to arbitrarily 
large datasets. Doing many small but noisy updates instead of fewer large ones (as in batch meth- 
ods) gives both a speed-up, and makes the learning process less likely to get stuck in sensitive local 
optima. In addition, SGD is eminently well-suited for learning in non- stationary environments, e.g., 
when that data stream is generated by a changing environment; but non- stationary adaptivity is use- 
ful even on stationary problems, as the initial search phase (before a local optimum is located) of the 
learning process can be likened to a non- stationary environment. 

Given the increasingly wide adoption of machine learning tools, there is an undoubted benefit to 
making learning algorithms, and SGD in particular, easy to use and hyper-parameter free. In recent 
work, we made SGD hyper-parameter free by introducing optimal adaptive learning rates that are 
based on gradient variance estimates |1|. While broadly successful, the approach was limited to 
smooth loss functions, and to minibatch sizes of one. In this paper, we therefore complement that 
work, by addressing and resolving the issues of 

• minibatches and parallelization, 

• sparse gradients, and 

• non-smooth loss functions 

all while retaining the optimal adaptive learning rates. All of these issues are of practical importance: 
minibatch parallelization has strong diminishing returns, but in combination with sparse gradients 
and adaptive learning rates, we show how that effect is drastically mitigated. The importance of 
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robustly dealing with non-smooth loss functions is also a very practical concern: a growing num- 
ber of learning architectures employ non-smooth nonlinearities, like absolute value normalization or 
rectified-linear units. Our final algorithm addresses all of these, while remaining simple to imple- 
ment and of linear complexity. 

2 Background 

There are a number of adaptive settings for SGD learning rates, or equivalently, diagonal precondi- 
tioning schemes, to be found in the literature, e.g., Gl[3]|4l[5l[6l[7]|. The aim of those is generally 
to increase performance on stochastic optimization tasks, a concern complementary to our focus 
of producing an algorithm that works robustly without any hyper-parameter tuning. Often those 
adaptive schemes produce monotonically decreasing rates, however, which makes them no longer 
applicable to non- stationary tasks. 

The remainder of this paper build upon the adaptive learning rate scheme of 1 1 ], which is not mono- 
tonically decreasing, so we recapitulate its main results here. Using an idealized quadratic and 
separable loss function, it is possible to derive an optimal learning rate schedule which preserves 
the convergence guarantees of SGD. When the problem is approximately separable, the analysis is 
simplified as all quantities are one-dimensional. The analysis also holds as a local approximation in 
the non-quadratic but smooth case. 

In the idealized case, and for any dimension i, the optimal learning rate can be derived analytically, 
and takes the following form 

Vi hi ' (6» 4 - 6>*)2 + ff 2 h . • E[V 2J ^ > 

where (9i — 9*) is the distance to the optimal parameter value, and of and hi are the local sample 
variance and curvature, respectively. 

We use an exponential moving average with time-constant r (the approximate number of samples 
considered from recent memory) for online estimates of the quantities in equation [TJ 

W <- (l-r-^-^ + rr 1 .^) 2 

where the diagonal Hessian entries h\ bbprop ^ are computed using the 'bbprop' procedure | 8 ], and the 
time-constant (memory) is adapted according to how large a step was taken: 

-•< i+i » - - If) -< (t > + 1 

The final algorithm is called vSGD, and used the learning rates from equation [T] to update the pa- 
rameters (element- wise): 

e <- e - ■ v e 



3 Parallelization with minibatches 

Compared to the pure online SGD, computation time can be reduced by "minibatch"-parallelization: 
n sample-gradients are computed (simultaneously, e.g., on multiple cores) and then a single update 
on the resulting averaged minibatch gradient is performed. 

V 9 =-^V« (2) 

While n can be seen as a hyperparameter of the algorithm [9], it is often constrained to a large 
extent by the computational hardware, memory requirements and communication bandwidth. A 
derivation just like the one that led to equation [T] can be used to determine the optimal learning 
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Figure 1 : Diminishing returns of minibatch parallelization. Plotted is the relative log-loss gain (per 
number of sample gradients evaluated) of a given minibatch size compared to the gain of the n = 1 
case (in the noisy quadratic scenario from section |2| for different noise levels a, and assuming 
optimal learning rates as in equation]?]); each figure corresponds to a different sparsity level. For 
example, the ratio is 0.02 for n = 100 (left plot, low noise): This means that it takes 50 times more 
samples to obtain the same gain in loss than with pure SGD. Those are strongly diminishing returns, 
but they are less drastic if the noise level is high (only 5 times more samples in this example). If 
the sample gradients are somewhat sparse, however, and we use that fact to increase learning rates 
appropriately, then the diminishing returns kick in only for much larger minibatch sizes; see the left 
two figures. 



rates automatically, for an arbitrary minibatch size n. The key difference is that the averaging in 
equation |2]reduces the effective variance by a factor n, leading to: 

= 1_ {Oi-Ot? = 1 (E[V,J) 2 
Vi K (0, - 0*)2 + I CT 2 hi i E[V 2 J + 2 ^! (E[Vd ) 2 

This expresses the intuition that using minibatches reduces the sample noise, in turn permitting 
larger step sizes: if the noise (or sample diversity) is small, those gains are minimal, if it is large, 
they are substantial (see Figure[TJ left). Varying minibatch sizes tend to be impracticaQto implement 
however, and so common practice is to simply fix a minibatch size, and then re-tune the learning 
rates (by a factor between 1 and n). With our adaptive minibatch-aware scheme (equation [3} this is 
no longer necessary: in fact, we get an automatic transition from initially small effective minibatches 
(by means of the learning rates) to large minibatches toward the end, when the noise level is higher. 

4 Sparse gradients 

Many common learning architectures (e.g., those using rectified linear units, or sparsity penalties) 
lead to sample gradients that are increasingly sparse, that is, they are non-zero only in small fraction 
of the problem dimensions. It is possible to exploit this to speed up learning, by averaging many 
sparse gradients in a minibatch, or by doing asynchronous updates iflOl . 

Here, we investigate how to set the learning rates in the presence of sparsity, and our result is simply 
based on the observation that doing an update using a set of sparse gradients is equivalent to doing 
the same update, but with a smaller effective minibatch size, while ignoring all the zero entries. 

We can do this again on an element-by-element basis, where we define Z{ to be the number of 
non-zero elements in dimension i, within the current minibatch. In each dimension, we rescale the 
minibatch gradient accordingly by a factor n/(n — Zi), and at the same time reduce the learning 
rate to reflect the smaller effective minibatch size. Compounding those two effects gives the optimal 
learning rate for sparse minibatches (we ignore the case Z{ = n, when there is no update): 

„• = n 1 (4) 
n ~ z * h > ^[V2j + ^i(E[V e J) 2 

Figure |T] shows how using minibatches with such adaptive learning rates reduces the impact of 
diminismng returns if the sample gradients are sparse. In other words, with the right learning rates, 
higher sparsity can be directly translated into higher parallelizability. 



1 lf the implementation/computational architecture is flexible enough, the variance-term of the learning rate 
can also be used to adapt the minibatch size adaptively to its optimal trade-off. 
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Figure 2: Difference between global or instance-based computation of effective minibatch sizes in 
the presence of sparse gradients. Our proposed method computes the number of non-zero entries 
(n — Zi) in the current mini-batch to set the learning rate (green). This involves some additional com- 
putation compared to just using the long-term average sparsity p\ nz ^ (red), but obtains a substantially 
higher relative gain (see figure [T}, especially in the regime where the sparsity level produces mini- 
batches with just one or a few non-zero entries (dent in the curve near n = l/p^ 1 ^). If the noise 
level is low (left two figures), the effect is much more pronounced than if the noise is higher. For 
comparison, the performance for 40 different fixed learning-rate SGD settings (between 0.01 and 
100) are plotted as yellow dots. 




Figure 3: Illustrating the effect of reweighting minibatch gradients. Assume the samples are drawn 
from 2 different noisy clusters (yellow and light blue vectors), but one of the clusters has a higher 
probability of occurrence. The regular minibatch gradient is simply their arithmetic average (red), 
dominated by the more common cluster. The reweighted minibatch gradient (blue) does a full step 
toward each of the clusters, closely resembling the gradient one would obtain by performing a hard 
clustering (difficult in practice) on the samples, in dotted green. 



An alternative to computing zi for each minibatch (and each dimension) anew would be to just use 

the long-term average sparsity p\ nz ^ = E [n — Zi] instead. Figure [2] shows that this is suboptimal, 
especially if the noise level is small, and in the regime where each minibatch is expected to contain 
just a few non-zero entries. This figure also shows that equation [4] produces a higher relative gain 
compared to the outer envelope of the performance of all fixed learning rates. 
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Figure 4: Illustrating the expectation over non-smooth sample losses. In dotted blue, the loss func- 
tions for a few individual samples are shown, each a non-smooth function. However, the expectation 
over a distribution of such functions is smooth, as shown by the thick magenta curve. Left: absolute 
value, right: rectified linear function; samples are identical but offset by a value drawn from A/"(0, 1) . 



4.1 Orthogonal gradients 



One reason for the boost in parallelizability if the gradients are sparse comes from the fact that 
sparse gradients are mostly orthogonal, allowing independent progress in each direction. But sparse 
gradients are in fact a special case of orthogonal gradients, for which we can obtain similar speedups 
with a reweighting of the minibatch gradients: 



n 

i=1 £ 7 = 



iv_r 



-V 



(i) 



(5) 



liv^i: 



liv^n 



In other words, each sample is weighted by one over the number of times (smoothed) that its gradient 
is interfering (non-orthogonal) with another sample's gradient. 

In the limit, this scheme simplifies to the sparse-gradient cases discussed above: if all sample gra- 
dients are aligned, they are averaged (reweighted by 1/n, corresponding to the dense case in equa- 
tion]^, and if all sample gradients are orthogonal, they are summed (reweighted by 1, corresponding 
to the maximally sparse case Z{ = n — 1 in equation|4]). See Figure|3]for an illustration. 

In practice, this reweighting comes at a certain cost, increasing the computational expense of a single 
iteration from 0(nd) to (D(n 2 d), where d is the problem dimension. In other words, it is only likely 
to be viable if the forward-backward passes of the gradient computation are non-trivial, or if the 
minibatch size is small. 



5 Non-smooth losses 

Many commonly used non-linearities (rectified linear units, absolute value normalization, etc.) pro- 
duce non-smooth sample loss functions. However, when optimizing over a distribution of samples 
(or just a large enough dataset), the variability between samples can lead to a smooth expected loss 
function, even though each sample has a non-smooth contribution. Figure H] illustrates this point for 
samples that have an absolute value or a rectified linear contribution to the Toss. 

It is clear from this observation that it is not possible to reliably estimate the curvature of the true 
expected loss function, from the curvature of the individual sample losses (which are all zero in the 
two examples above), if the sample losses are non-smooth. This means that our previous approach 
of estimating the hi term in the optimal learning rate expression by a moving average of sample 
curvatures, as estimated by the "bbprop" procedure [ 8 ] (which computes a Gauss-Newton approx- 
imation of the diagonal Hessian, at the cost of one additional backward pass) is limited to smooth 
sample loss functions, and we need a different approach for the general casa^l 



2 This also alleviates potential implementation effort, e.g., when using third-party software that does not 
implement bbprop. 
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Algorithm 1: vSGD-fd: minibatch-SGD with finite-difference-estimated adaptive learning rates 
repeat 

(i) 

draw n samples, compute the gradients V# for each sample j 
compute the gradients on the same samples, with the parameters shifted by 5{ = ~gl 
for i e {1, . . . , d} do 

fd(') V (j) -V (j) 

compute finite-difference curvatures h\ — — ^ L ^ ± 



if \V 6i -gi\ >2^W l -gf or \h{ d - h/ d \ > 2^ d - ( 

| increase memory size for outliers <— ti + 1 
end 



ry 



then 



update moving averages _ /d ^ _ lx _ /d _ _ x n v fd(j) 
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fd(j)\ 2 



f d ^+(n-l)-(^) 2 



update memory size Ti <— [1 ^— ) ■ + 1 



update parameter 0* <- 0* - 77* • £ ££=1 V^f 

end 

until stopping criterion is met 



5.1 Finite-difference curvature 

A good estimate of the relevant curvature for our purposes (i.e., for determining a good learning 
rate) is to not to compute the true Hessian at the current point, but to take the expectation over noisy 
finite-difference steps, where those steps are on the same scale than the actually performed update 
steps, because this is the regime we care about. 

In practice, we obtain this finite-difference estimates by computing two gradients of the same sample 
loss, on points differing by the typical update distancq^] 



(6) 



where 5{ = ~gl. This approach is related to the diagonal Hessian preconditioning in SGD-QN ifTTIl . 
but the step-difference used is different, and the moving average scheme there is decaying with time, 
which thus loses the suitability for non- stationary problems. 

5.2 Curvature variability 

To further increase robustness, we reuse the same intuition that originally motivated vSGD, and 
take into account the variance of the curvature estimates (produced by the finite-difference method) 
to reduce the likelihood of becoming overconfident (underestimating curvature, i.e., overestimating 
learning rates) by using a variance-normalization based on the signal-to-noise ratio of the curvature 
estimates. 



3 Of course, this estimate does not need to be computed at every step, which can save computation time. 
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For this purpose we maintain two additional moving averages: 

h/ d <- (l-O-^ + rr 1 -^ 

w /d <- (l-O-^+rr 1 -^) 2 

and then compute the curvature term simply as hi = v~J d /h/ d 

5.3 Outlier detection 

If an outlier sample is encountered while the time constants T{ is close to one (i.e., the history is 
mostly discarded from the moving averages at each update), this has the potential to disrupt the 
optimization process. Here, the statistics we keep for the adaptive learning rates have an additional, 
unforeseen benefit: they make it trivial to detect outliers. 

The outlier's effect can be mitigated relatively simply by increasing the time-constant before 
incorporating the sample into the statistics (to make sure old samples are not forgotten), and then 
due to the perceived variance shooting up, the learning rate is automatically reduced. If it was not 
an outlier, but a genuine change in the data distribution, the algorithm will quickly adapt, increase 
the learning rates again. 

In practice, we use a detection threshold of two standard deviations, and increase the corresponding 
Ti by one (see pseudocode). 

5.4 Algorithm 

Algorithm [T] gives the explicit pseudocode for this finite-difference estimation, in combination with 
the minibatch size-adjusted rates from equation [3j termed "vSGD-fd". Initialization is akin to the 
one of vSGD, in that all moving averages are bootstrapped on a few samples (10) before any updates 
are done. It is also wise to add an tiny e = 10 ~ 5 term where necessary to avoid divisions by zero. 



6 Simulations 



An algorithm that has the ambition to work out-of-the-box, without any tuning of hyper-parameters, 
must be able to pass a number of elementary tests: those may not be sufficient, but they are necessary. 
To that purpose, we set up a collection of elementary (one-dimensional) stochastic optimization test 
cases, varying the shape of the loss function, its curvature, and the noise level. The sample loss 
functions are 



fauad = A-(0-^y 



fabs 
frectlin 

f gauss 



A-{6-&)) if0_f(i) > o 
otherwise 



A-Ae 



where A is the curvature setting and the £^ are drawn from A/"(0, a 2 ). We vary curvature and noise 
levels by two orders of magnitude, i.e., A G {0.1, 1, 10} and a 2 G {0.1, 1, 10}, giving us 9x4 test 
cases. To visualize the large number of results, we summarize the each test case and algorithm 
combination in a concise heatmap square (see Figure [5] for the full explanation). 

In Figure [6| we show the results for all test cases on a range of algorithms and minibatch sizes n. 
Each square shows the gain in loss for 100 independent runs of 1024 updates each. Each group of 
columns corresponds to one of the four functions, with the 9 inner columns using different curvature 
and noise level settings. Color scales are identical for all heatmaps within a column, but not across 
columns. Each group of rows corresponds to one algorithm, with each row using a different hyper- 
parameter setting, namely initial learning rates r]o G {0.01, 0.1, 1, 10} (for SGD, AdaGrad (6) 
and the natural gradient [7 ]) and decay rate 7 G {0, 1} for SGD. All rows come in pairs, with the 
upper one using pure SGD (n = 1) and the lower one using minibatches (n = 10). 
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Figure 5: Explanation of how to read our concise heatmap performance plots (right), based on the 
more common representation as learning curves (left). In the learning curve representation, we plot 
one curve for each algorithm and each trial (3x8 total), with a unique color/line-type per algorithm, 
and the mean performance per algorithm with more contrast. Performance is measured every power 
of 2 iterations. This gives a good idea of the progress, but becomes quickly hard to read. On the right 
side, we plot the identical data in heatmap format. Each square corresponds to one algorithm, the 
horizontal axis are still the iterations (on log 2 scale), and on the vertical axis we arrange (sort) the 
performance of the different trials at the given iteration. The color scale is as follows: white is the 
initial loss value, the stronger the blue, the lower the loss, and if the color is reddish, the algorithm 
overjumped to loss values that are bigger than the initial one. Good algorithm performance is visible 
when the square becomes blue on the right side, instability is marked in red, and the variability of 
the algorithm across trials is visible by the color range on the vertical axis. 



The findings are clear: in contrast to the other algorithms tested, vSGD-fd does not require any 
hyper-parameter tuning to give reliably good performance on the broad range of tests: the learning 
rates adapt automatically to different curvatures and noise levels. And in contrast to the predecessor 
vSGD, it also deals with non-smooth loss functions appropriately. The learning rates are adjusted 
automatically according to the minibatch size, which improves convergence speed on the noisier test 
cases (3 left columns), where there is a larger potential gain from minibatches. 

The earlier variant (vSGD) was shown to work very robustly on a broad range of real- world bench- 
marks and non-convex, deep neural network-based loss functions. We expect those results on smooth 
losses to transfer directly to vSGD-fd. This bodes well for future work that will determine its per- 
formance on real- world non-smooth problems. 



7 Conclusion 



We have presented a novel variant of SGD with adaptive learning rates that expands on previous 
work in three directions. The adaptive rates properly take into account the minibatch size, which 
in combination with sparse gradients drastically alleviates the diminishing returns of parallelization. 
Also, the curvature estimation procedure is based on a finite-difference approach that can deal with 
non- smooth sample loss functions. The final algorithm integrates these components, has linear 
complexity and is hyper-parameter free. Unlike other adaptive schemes, it works on a broad range 
of elementary test cases, the necessary condition for an out-of-the-box method. 

Future work will investigate how to adjust the presented element- wise approach to highly non- 
separable problems (tightly correlated gradient dimensions), potentially relying on a low-rank or 
block-decomposed estimate of the gradient covariance matrix, as in TONGA (Y2\ . 
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Figure 6: Performance comparisons for a number of algorithms (row groups) under different setting 
variants (rows) and sample loss functions (columns), the latter grouped by loss function shape. Red 
tones indicate a loss value worsening from its initial value, white corresponds to no progress, and 
darker blue tones indicate a reduction of loss (in log-scale). For a detailed explanation of how to read 
the heatmaps, see Figure [5] The new proposed algorithm vSGD-fd (bottom row group) performs 
well across all functions and noise-level settings, namely fixing the vSGD instability on non-smooth 
functions like the absolute value. The other algorithms need to have their hyper-parameters tuned to 
the task to work well. 
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